Back

Nature Biotechnology

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match Nature Biotechnology's content profile, based on 147 papers previously published here. The average preprint has a 0.34% match score for this journal, so anything above that is already an above-average fit.

1
Carrierwave: A granular, incentive-aligned infrastructure for scientific communication

Bachelet, I.

2026-03-03 scientific communication and education 10.64898/2026.03.01.708795 medRxiv
Top 0.1%
41.9%
Show abstract

The peer-reviewed journal article imposes structural constraints on the dissemination, validation, and reuse of research outputs. Intermediate results, negative findings, methodological refinements, and replication attempts are systematically underrepresented in published literature, limiting visibility into ongoing research activity for both scientists and mission-driven funders. Here we present Carrierwave, an open infrastructure for continuous, granular scientific communication built on structured research objects (ROs), cryptographic provenance, blockchain-based attribution, and programmable incentive mechanisms. Each RO represents an atomic unit of scientific output -- a single experimental result, negative finding, dataset, protocol, or replication -- that is hashed for content integrity, stored in a persistent database, and optionally minted as an ERC-721 non-fungible token on the Ethereum blockchain. The system includes an on-chain bounty pool enabling funders to directly incentivize specific research activities, and an automated analysis layer that synthesizes disclosed ROs into continuously updated research landscape maps. We describe the system architecture, report on its implementation and deployment on Ethereum mainnet, and present a quantitative analysis of disease-specific publication frequency demonstrating the information latency problem that Carrierwave addresses. The distribution of publication frequency across disease areas is highly skewed, with the majority of conditions represented by fewer than four publications per year in high-impact biology journals. For diseases in the long tail, the interval between successive publications may span months or years. Publication frequency correlates poorly with disease burden, instead reflecting historical research community size and advocacy momentum. By reducing the unit of communication to the individual research object and eliminating editorial gatekeeping as a prerequisite for disclosure, Carrierwave increases the effective sampling rate of scientific activity in precisely the domains where publication-based visibility is most sparse. The system is live at https://carrierwave.org.

2
ESPeR-seq: Extremely Sensitive and Pure, End-to-end, RNA-seq library preparation

Chen, H.-M.; Kao, J.-C.; Yang, C.-P.; Tan, C.; Lee, T.; Sugino, K.

2026-03-15 genomics 10.64898/2026.03.12.711386 medRxiv
Top 0.1%
40.2%
Show abstract

The Smart-seq family of methods represents the gold standard for high-sensitivity, full-length single-cell RNA sequencing. Despite iterative improvements, fundamental challenges remain: the generation of non-specific PCR products that limit sensitivity, the inability to capture precise Transcription End Sites (TES), and the insidious generation of "phantom UMIs"--artificial molecular barcodes created during PCR that systematically inflate molecular counts. Here, we present ESPeR-seq, a novel architecture that resolves these barriers. To enable precise, stranded TES capture, we developed an "Omega-dT" primer that bypasses synthetic poly-T tracts, restoring high-quality sequencing directly at transcript termini. To eliminate both PCR background and phantom UMIs, we implemented a biochemical "multi-lock" mechanism utilizing uracil-containing TSOs and a uracil-intolerant DNA polymerase. We validate this approach using the logQ-slope, a novel metric that sensitively diagnoses UMI fidelity. Benchmarking reveals that while state-of-the-art methods still exhibit signs of UMI inflation, ESPeR-seq strictly prevents it. Furthermore, the strandedness and precise end-delineation provided by TSO and dT reads support robust de novo gene model reconstruction, enabling the discovery of novel multi-exon genes, unannotated 3 UTR extensions, and candidate eRNAs across aggregated single-cell populations. Thus, ESPeR-seq establishes a robust framework for absolute quantitative accuracy and full-length isoform resolution.

3
Structure-Led Exploration of the Metagenome Yields Novel RNA-Guided Nucleases with Broad PAM Diversity

de los Santos, E. L.; Rieber, L.; Wang, M.; Catherman, S.; Hatfield, S.; Bowen, T.

2026-03-29 genomics 10.64898/2026.03.27.714800 medRxiv
Top 0.1%
34.2%
Show abstract

CRISPR-Cas bacterial adaptive immune systems use reprogrammable RNA guide sequences to specifically bind and cleave nucleic acids, which have been repurposed for easy and relatively efficient genomic editing. Despite its widespread use in biomedical research, the large size of Cas9 hinders AAV-mediated therapeutic delivery. Smaller RNA-guided nucleases could improve AAV gene therapy delivery, but their application is limited by their rarity among bacterial genomes and the restrictive sequence preferences of known systems, especially compared to the diversity of PAMs seen in the highly abundant Cas9 systems. Existing methods for identification of novel CRISPR subtypes rely on sequencing ever more bacterial genomes and comparing sequence homology. Using recent advances in protein structure prediction and comparison, we have identified and characterized proteins from known and novel compact RNA guided nucleases and demonstrated that their PAM preference diversity meets or exceeds that of Cas9 systems or the compact IscB and TnpB systems. This discovery has enabled us to demonstrate editing in eukaryotic cells with multiple novel subtypes, which--together with their compact size, varied PAM sequences, and high specificity--make them attractive tools for in vivo genome editing

4
LoReMINE: Long Read-based Microbial genome mining pipeline

Agrawal, A. A.; Bader, C. D.; Kalinina, O. V.

2026-02-04 bioinformatics 10.64898/2026.02.02.703231 medRxiv
Top 0.1%
33.4%
Show abstract

Microbial natural products represent a chemically diverse repertoire of small molecules with major pharmaceutical potential. Despite the increasing availability of microbial genome sequences, large-scale natural product discovery remains challenging because the existing genome mining approaches lack integrated workflows for rapid dereplication of known compounds and prioritization of novel candidates, forcing researchers to rely on multiple tools that requires extensive manual curation and expert intervention at each step. To address these limitations, we introduce LoReMINE (Long Read-based Microbial genome mining pipeline), a fully automated end-to-end pipeline that generates high-quality assemblies, performs taxonomic classification, predicts biosynthetic gene clusters (BGCs) responsible for biosynthesis of natural products, and clusters them into gene cluster families (GCFs) directly from long-read sequencing data. By integrating state-of-the-art tools into a seamless pipeline, LoReMINE enables scalable, reproducible, and comprehensive genome mining across diverse microbial taxa. The pipeline is openly available at https://github.com/kalininalab/LoReMINE and can be installed via Conda (https://anaconda.org/kalininalab/loremine), facilitating broad adoption by the natural product research community. Author summaryFor decades, microbial natural products have been a major source of medicines, with most of the clinically used antibiotics being their derivatives. Recent advances in DNA sequencing technologies now allow the reconstruction of more complete and continuous microbial genomes, revealing a vast and largely untapped diversity of biosynthetic gene clusters responsible for natural product biosynthesis. Despite these advances, large-scale natural product discovery remains difficult because current genome mining approaches rely on many separate tools and lack an integrated workflow to dereplicate known compounds and prioritize novel biosynthetic pathways. To address these limitations, we introduce LoReMINE, an automated pipeline designed to simplify microbial genome mining directly from long-read sequencing data. LoReMINE integrates genome assembly, taxonomic classification, identification of biosynthetic gene clusters, and their clustering into gene cluster families within a single, reproducible workflow. This streamlined approach enables scalable analysis across diverse microbial taxa and facilitates comprehensive exploration of microbial biosynthetic potential. The pipeline is designed for both experimental and computational researchers, helping to advance natural product research and contribute towards the discovery of new therapeutic drugs.

5
Enabling Megascale Microbiome Analysis with DartUniFrac

Zhao, J.; McDonald, D.; Sfiligoi, I.; Lladser, M. E.; Patel, L.; Weng, Y.; Khatib, L.; Degregori, S.; Gonzalez, A.; Lozupone, C.; Knight, R.

2026-03-03 bioinformatics 10.64898/2026.03.01.708916 medRxiv
Top 0.1%
33.1%
Show abstract

We introduce a new algorithm, DartUniFrac, and a near-optimal implementation with GPU acceleration, up to three orders of magnitude faster than the state of the art and scaling to millions of samples (pairwise) and billions of taxa. DartUniFrac connects UniFrac with weighted Jaccard similarity and exploits sketching algorithms for fast computation. We benchmark DartUniFrac against exact UniFrac implementations, demonstrating that DartUniFrac is statistically indistinguishable from them on real-world microbiome and metagenomic datasets.

6
NanoSimFormer: An end-to-end Transformer-based simulator for nanopore sequencing signal data

Xie, S.; Ding, L.; Liu, L.; Zhu, Z.

2026-01-25 bioinformatics 10.64898/2026.01.20.700442 medRxiv
Top 0.1%
33.1%
Show abstract

Nanopore sequencing has achieved a new standard of accuracy with the advent of R10.4.1 flow cell and high-performance Transformer-based basecalling models. However, existing signal simulators often fail to capture the complex, non-linear dynamics of nanopore current signals, relying on static pore models or lacking optimization objectives linked to basecalling, resulting in synthetic signals with substantially lower accuracy and fidelity than experimental data. To address this, we introduce NanoSimFormer, an end-to-end Transformer-based signal simulator that integrates basecaller guidance during training to generate high-fidelity nanopore signals explicitly optimized for accurate calling. Rigorous evaluation across diverse human, bacterial, and fungal R10.4.1 DNA sequencing datasets demonstrates that NanoSimFormer consistently outperformed competing methods (seq2squiggle and Squigulator), achieving median read accuracies exceeding 99% and Q-scores above 22.8, closely matching experimental baselines. NanoSimFormer faithfully recapitulated experimental variant calling performance on the human HG002 sample, achieving F1-scores of 0.9967 for SNPs and 0.8295 for small indels, and notably minimized false-positive errors in homopolymer and short tandem repeat (STR) regions where other simulators struggled. Furthermore, NanoSimFormer-derived reads enabled high-quality de novo bacterial assembly with consensus error rates below one mismatch per 100 kbp, comparable to experimental assemblies, and preserved fungal mock community structures with high correlation to experimental abundance profiles in metagenomic benchmarks. With tunable parameters for amplitude noise and event duration variance, NanoSimFormer enables the simulation of datasets spanning a wide range of data qualities. Together, these results establish NanoSimFormer as a robust tool for benchmarking and algorithm development in the latest nanopore sequencing era.

7
Sequencing depth overcomes extraction bias: repurposing human WGS data for salivary microbiome profiling

Velo-Suarez, L.; Herzig, A. F.; Bocher, O.; Le Folgoc, G.; Le Roux, L.; Delmas, C.; Zins, M.; Deleuze, J.-F.; Hery-Arnaud, G.; Genin, E.

2026-04-01 genomics 10.64898/2026.03.27.714786 medRxiv
Top 0.1%
32.3%
Show abstract

Large-scale human genomic projects have generated whole-genome sequencing (WGS) data from hundreds of thousands of individuals, primarily to study host genetic variation. When saliva is the DNA source, the resulting datasets also contain microbial reads that are routinely discarded. Here, we investigate whether these host-centric WGS workflows can yield reliable microbiome profiles, effectively doubling the research value of existing data without additional sampling. We compared non-human reads from 39 deeply sequenced saliva samples from the GAZEL cohort (miG dataset; median [~]43 million reads/sample) with 14 samples processed with microbiome-optimized extraction (ASAL; median [~]4.3 million reads/sample), using two complementary classifiers: meteor, a coverage-based mapper against a curated saliva-specific database, and sylph, a k-mer classifier against the Genome Taxonomy Database (GTDB). Despite the absence of microbial lysis optimization, miG samples showed up to 3-fold higher species richness, [~]10-fold greater sequencing depth, and significantly lower inter-sample variability (PERMANOVA R{superscript 2} = 0.10, p = 0.001; BETADISPER p = 0.0036). Rarefaction to 10 reads eliminated most compositional differences, demonstrating that sequencing depth is the primary driver of community stability. Only [~]2% of detected taxa (12 of 592) showed extraction-related differences. The two classifiers exhibited fundamentally different depth-sensitivity profiles, with sylph retaining systematic detection asymmetries even after depth normalization, highlighting that classifier choice introduces biases that affect cross-study comparisons. These results show that biobank WGS data from saliva can be repurposed for robust, population-scale oral microbiome analyses, enabling simultaneous investigation of host genomic variation and the microbiome from the same archived samples. ImportanceSaliva-based whole-genome sequencing datasets generated across various cohorts to study human genetics contain non-human reads that are routinely discarded, thereby overlooking valuable microbial information. We show that these reads are sufficient to reconstruct robust oral microbiome profiles -- without any additional sampling or laboratory work. This finding unlocks a vast archive of existing genomic data for retrospective microbiome research, enabling population-scale studies of oral microbial diversity, host-microbiome interactions, and disease associations at minimal additional cost. We further demonstrate that the choice of taxonomic classifier introduces systematic, depth-dependent biases that persist even after normalization, a practical consideration for any cross-cohort or multi-platform microbiome study.

8
BenchDrop-seq: a microfluidics-free platform for benchtop single-cell long-read RNA sequencing

Bregman, J.; Nichols, C.; Ramisetti, R.; Srivastava, A.

2026-03-12 genomics 10.64898/2026.03.12.706999 medRxiv
Top 0.1%
28.3%
Show abstract

Single-cell long-read RNA sequencing enables direct measurement of full-length transcripts but has remained difficult to deploy at scale due to reliance on microfluidic barcoding, specialized instrumentation, and high per-cell cost. Here we present BenchDrop-seq, a benchtop platform for single-cell long-read transcriptomics that leverages particle-templated partitioning for single-cell molecular barcoding and couples this workflow to Oxford Nanopore sequencing for full-length transcript capture. By integrating established bead-based partitioning chemistry with long-read sequencing and a dedicated open-source analysis pipeline for barcode recovery, alignment, and transcript quantification, BenchDrop-seq enables isoform-resolved measurements from thousands of individual cells using standard laboratory equipment. We validate the platform in both a homogeneous cell line and a heterogeneous primary tissue, demonstrating high barcode recovery, accurate gene-level quantification, and reproducible detection of cell-type-specific transcript usage that is not readily accessible to short-read assays. Together, BenchDrop-seq establishes a practical and accessible framework for single-cell long-read RNA sequencing, lowering experimental barriers while enabling transcript-level analyses in routine single-cell experiments.

9
SWARM: A Single-Molecule Workflow for High-Precision Profiling of RNA Modifications

Prodic, S.; Cleynen, A.; Mahmud, S.; Srivastava, A.; Ravindran, A.; Kanchi, M.; Hajizadeh Dastjerdi, A.; Sethi, A. J.; Corovic, M.; Jain, R.; Guarnacci, M.; Santos-Rodriguez, G.; Vieira, G.; Weatheritt, R. J.; Hayashi, R.; Martinez, N. M.; Shirokikh, N. E.; Eyras, E.

2026-01-23 bioinformatics 10.64898/2025.12.18.695332 medRxiv
Top 0.1%
28.3%
Show abstract

Nanopore direct RNA sequencing promises to decode the epitranscriptome by detecting multiple modifications on individual RNA molecules, but its potential for biological discovery is hampered by high false-positive rates. We present SWARM, an AI-based framework designed to overcome this fundamental limitation. Its key innovation is a crosstalk-aware training strategy that incorporates non-target modifications and orthogonally validated cellular signals, enabling high-precision detection of m6A, pseudouridine ({Psi}), and m5C at single-nucleotide and single-molecule resolution. Using rigorous in vitro and cellular RNA benchmarks, SWARM outperforms existing tools and maintains strong agreement with orthogonal methods. Applying SWARM across mammalian tissues reveals thousands of novel modification sites with confirmed motifs and localisation patterns. Our high-resolution multi-tissue modification map revealed no evidence of widespread m6A-{Psi} interplay, challenging models of a coordinated epitranscriptomic code. We further discovered a previously unrecognised splicing-shaped mode of {Psi} deposition, whereby TRUB1-mediated pseudouridylation preferentially occurs after exon-exon ligation, consistent with local RNA structure stabilisation. SWARM provides a robust, universally applicable tool for epitranscriptome discovery.

10
Adaptive sampling-based enrichment enables genome reconstruction of intracellular symbionts despite host background and reference divergence

Huang, W.-K.; Yang, C.-H.; Chung, H.; Lee, Y.-C.; Wu, Y.-C.; Chen, Y.-T.; Wan, M.-H.; Yeh, W.-S.; Hong, Y.-P.; Wu, T.-H.; Li, J.-C.; Liu, W.-L.; Chen, C.-H.; Chen, Y.-T.

2026-03-27 genomics 10.64898/2026.03.25.714109 medRxiv
Top 0.1%
28.1%
Show abstract

Recovering genomes of intracellular microbes from host-dominated samples remains a major challenge in microbial genomics, due to low target abundance, overwhelming host DNA, and the inability to culture these organisms independently. Despite extensive interest in Wolbachia, efficient genome recovery directly from host tissues remains limited by the inefficiency of host-dominated sequencing and the constraints of existing enrichment strategies. Here, we demonstrate that Oxford Nanopore adaptive sampling (AS) enables efficient, real-time enrichment of target DNA directly from complex host tissues, providing a culture-free approach for genome recovery in such systems. To our knowledge, this represents the first application of enrichment-mode adaptive sampling to achieve de novo reconstruction of an intracellular endosymbiont genome in a mosquito system. Using Aedes aegypti mosquitoes infected with a locally derived wAlbB-like strain, we applied enrichment-mode AS to selectively sequence Wolbachia DNA. This resulted in an increase from <1% Wolbachia reads in conventional shotgun data to [~]90% under adaptive sampling. De novo assembly of AS-enriched long reads yielded a near-complete genome ([~]1.5 Mb) in two contigs with >96-99% completeness. Comparative analyses revealed multiple large-scale chromosomal rearrangements relative to the reference wAlbB genome, demonstrating that adaptive sampling does not impose reference-dependent genome structure. Annotation further identified three prophage-associated regions, including two strain-specific expansions absent from the reference genome. Notably, cytoplasmic incompatibility genes (cifA and cifB) were identified adjacent to one of these regions, consistent with their known genomic association with prophage elements. Importantly, adaptive sampling remained effective despite substantial structural divergence between the reference and target genomes, revealing an unexpectedly robust application of this approach beyond its presumed operating conditions. Together, these results establish enrichment-mode adaptive sampling as a robust and scalable strategy for genome-resolved analysis of intracellular bacteria in host-associated systems.

11
Scalable Microbiome Network Inference: Mitigating Sparsity and Computational Bottlenecks in Random Effects Models

Roy, D.; Ghosh, T. S.

2026-03-31 bioinformatics 10.64898/2026.03.27.714858 medRxiv
Top 0.1%
27.8%
Show abstract

The application of Large Language Models (LLMs) and Transformers to biological and healthcare datasets requires the extraction of highly accurate, noise-filtered ecological networks. The Random Effects Model (REM) is a powerful statistical method for inferring microbial interaction networks and identifying keystone species across heterogeneous studies. However, existing implementations in R that rely on single-threaded "Iteratively Reweighted Least Squares" (IRLS) are computationally prohibitive for high-dimensional metagenomic data, creating a significant bottleneck for downstream machine learning pipelines. In this paper, we present Parallel-REM, a highly scalable, Python-based parallel pipeline accelerating large-scale network inference. By integrating robust variance filtering, sparsity checks, and a batched Master-Worker parallelisation strategy using joblib and statsmodels, we resolve native convergence failures associated with sparse biological matrices. Benchmarking on a massive clinical dataset comprising 70,185 samples and 466 optimal species demonstrates a 26.1x speedup over sequential baselines on a 64-core architecture, reducing computation time from days to minutes. Furthermore, statistical validation shows > 99.9% directional concordance with the original R implementation. Parallel-REM democratises largescale network extraction, providing the high-throughput infrastructure necessary to feed clean, topological and biological features into modern deep learning and Transformer-based diagnostic architectures.

12
STCS: A Platform-Agnostic Framework for Cell-Level Reconstruction in Sequencing-Based Spatial Transcriptomics

Chen Wu, L.; Hu, X.; Zhan, F.; Sun, C.; Gonzales, J.; Ofer, R.; Tran, T.; Verzi, M. P.; Liu, L.; Yang, J.

2026-03-02 bioinformatics 10.64898/2026.02.26.708370 medRxiv
Top 0.1%
27.8%
Show abstract

Sequencing-based spatial transcriptomics technologies, including Visium HD and Stereo-seq, now enable transcriptome-wide profiling at subcellular resolution. However, these platforms generate measurements over spatially barcoded units rather than biologically segmented cells, creating a fundamental bottleneck for cell-centric analysis and interpretation. Robust recon-struction of coherent single-cell transcriptomes from high-density spatial bins remains an unresolved computational challenge. Here we present STCS (Spatial Transcriptomics Cell Segmentation), a platform-agnostic framework that reconstructs cell-level gene expression profiles by integrating nuclei segmentation with a joint transcriptomic-spatial distance model. STCS is governed by two interpretable parameters and incorporates a reference-free parameter selection strategy based on internal stability and spatial coherence metrics, enabling adaptable deployment across tissue types and technologies without requiring matched ground-truth annotations. We benchmark STCS on a Visium HD human lung cancer dataset with matched Xenium-derived cell segmentation, enabling direct cell-level validation, and on high-resolution Stereo-seq mouse brain data to assess cross-platform generalizability. Across multiple evaluation dimensions--including cell-type agreement, spatial organization, gene-expression fidelity, and compositional accuracy--STCS achieves consistent improvements over existing methods while preserving biologically coherent spatial structure. As sequencing-based spatial transcriptomics is rapidly adopted across biomedical research, STCS provides a broadly applicable and open-source solution for reconstructing cell-resolved transcriptomes, facilitating more reliable downstream analyses and cross-platform integration.

13
MERFISH 2.0, an ultra-sensitive single-cell spatial transcriptomics imaging chemistry across diverse tissue types

He, J.; He, L.; Wang, B.; Wiggin, T.; Chen, R.; Wang, H.; Yang, B.; Tattikota, S. G.; Maziashvili, L.; Zhang, T.; Revuru, S.; Wang, S.; Patil, S.; Sun, Y.; Sun, Y.; Li, M.; Cai, Y.; Wu, L.; Pentrenko, N.; Vasaturo, A.; Ray, M.; Emanuel, G.

2026-03-06 genomics 10.64898/2026.03.06.710199 medRxiv
Top 0.1%
26.9%
Show abstract

Spatial transcriptomics has emerged as a transformative approach for elucidating tissue architecture, cellular heterogeneity, and disease mechanisms by preserving the spatial context of gene expression in cells. Despite these advances, many spatial transcriptomic methods underperform in archival or clinically relevant specimens, particularly formalin-fixed, paraffin-embedded (FFPE) tissues, where RNA degradation and crosslinking hinder transcript detection. To address these challenges, we developed Multiplexed Error Robust Fluorescence In Situ Hybridization 2.0 (MERFISH 2.0), an optimized spatial transcriptomic imaging chemistry to enhance profiling of fragmented and highly crosslinked RNA. Across diverse human and mouse tissues preserved as fresh-frozen, fixed-frozen, and FFPE specimens, MERFISH 2.0 substantially increased transcript detection sensitivity by up to [~]8-fold relative to MERFISH 1.0 while preserving quantitative concordance (Pearson r [&ge;] 0.8 across tissues). In archived fresh-frozen human brain samples, MERFISH 2.0s enhanced sensitivity improved transcript recovery, enhanced cell type resolution and spatial analyses. In low quality archival FFPE human breast cancer specimen, MERFISH 2.0 revealed additional cell populations, novel cell clusters, refined tumor-immune architecture, and increased detection of gene-gene and cell-cell interactions relative to MERFISH 1.0, underscoring the impact of improved sensitivity on downstream spatial analysis. By substantially expanding robust transcript detection to degraded and archival samples, MERFISH 2.0 enables scalable, cohort-level spatial transcriptomic analysis across clinically relevant tissue collections.

14
A modular transcript enrichment strategy for scalable, atlas-aligned, and clonotype-resolved single-cell transcriptomics

Vaikunthan, M.; Schoonen, A. C. M.; Lia, I.-T.; Guo, J. H.; McFaline-Figueroa, J. L.

2026-02-06 genomics 10.64898/2026.02.06.703342 medRxiv
Top 0.1%
25.7%
Show abstract

Existing targeted approaches either rely on per-gene barcoded probes and transcript tiling, which limit scalability, or forgo reverse transcription, precluding capture of highly variable transcripts. Here, we present targeted reverse transcription-linker (TRTL), a modular method that can be integrated into existing single-cell transcriptomic workflows to enable targeted readout of user-defined transcript panels, with minimal changes in protocol when scaling from tens to thousands of genes. Because TRTL retains reverse transcription, it also enables the capture of variable transcripts, such as TCR and BCR sequences. Applying TRTL and combinatorial indexing RNA-seq to the mouse brain, we show that carefully designed panels support robust alignment to existing reference atlases, enabling accurate cell type annotation and detection of cellular populations at low sequencing depths. Lastly, we combine TRTL with nuclear hashing-based multiplexing for a targeted-sci-Plex protocol and further demonstrate that targeted-sci-Plex can resolve dynamic T-cell fate trajectories following diverse activating exposures while concurrently profiling T-cell clonotypes.

15
REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning

Gomez-Perez, D.; Raguideau, S.; Warring, S.; James, R.; Hildebrand, F.; Quince, C.

2026-03-08 bioinformatics 10.64898/2026.03.05.709928 medRxiv
Top 0.1%
25.7%
Show abstract

Metagenome-assembled genomes (MAGs) are central to exploring microbial communities. Yet, despite the relevance of protists and fungi to diverse ecosystems, eukaryotic MAG recovery lags behind that of prokaryotes. A major bottleneck is that most state-of-the-art binning pipelines exclusively rely on prokaryotic single-copy core gene reference databases and are optimized for smaller genomes. To address this gap, we present REMAG (Recovery of Eukaryotic MAGs), a tool designed to recover high-quality eukaryotic genomes suited for long-read metagenomic data. REMAG leverages fine-tuned HyenaDNA genomic foundation models to efficiently filter eukaryotic contigs. It then employs a dual-encoder Siamese network trained with Barlow Twins contrastive loss to learn a shared embedding space by integrating contig composition and differential coverage. Finally, high-quality bins are extracted using greedy iterative Leiden clustering optimized with eukaryotic single-copy core gene constraints. In benchmarks based on simulated mixed prokaryotic/eukaryotic communities and real datasets of varying sizes and origin, we demonstrate REMAGs ability to recover more near-complete eukaryotic genomes than existing state-of-the-art tools, which often produce highly fragmented eukaryotic bins. REMAG provides an automated eukaryotic binning method that scales effectively with the increasing size and sequencing depth of metagenomic datasets.

16
ESGI: Efficient splitting of generic indices in single-cellsequencing data

Stohn, T.; van de Brug, N. D.; Theodosiadou, A.; Thijssen, B.; Jastrzebski, K.; Wessels, L. F. A.; Bosdriesz, E.

2026-03-06 bioinformatics 10.64898/2026.03.04.709594 medRxiv
Top 0.1%
25.2%
Show abstract

Single-cell sequencing technologies increasingly rely on complex nucleotide barcoding schemes to encode cellular identities, experimental conditions, and multiple molecular modalities within a single experiment. While demultiplexing, alignment, and UMI-based quantification form the core preprocessing steps that transform raw sequencing reads into analyzable single-cell data, existing pipelines are often tightly coupled to specific experimental designs and typically assume fixed barcode positions and substitution-only error models. As a result, many emerging assays employing combinatorial, variablelength, or multimodal barcoding designs require custom, hard-coded preprocessing solutions that are difficult to generalize and maintain. Here, we present ESGI (Efficient Splitting of Generic Indices), a flexible and extendable framework for demultiplexing and processing single-cell sequencing data with arbitrary barcode architectures. ESGI operates directly on raw FASTQ files using a generic barcode pattern specification, supports barcode matching with insertions and deletions via Levenshtein distance, accommodates variable-length barcodes, and provides detailed quality metrics for barcode assignment. ESGI optionally integrates genome alignment via STAR and performs feature quantification and UMI collapsing to generate cellby-feature count matrices. ESGI is well documented and readily applicable to novel single-cell experiments. We demonstrate the versatility of ESGI across six datasets spanning four distinct single-cell technologies, including combinatorial indexing-based transcriptomic and multimodal assays, feature barcode-based protein measurements, and spatial barcoding data. Across these applications, ESGI robustly demultiplexes complex barcode designs that are not natively supported by existing pipelines, while producing results comparable to established workflows where applicable. Together, ESGI provides a general and future-proof solution for preprocessing single-cell sequencing data, enabling rapid adoption and analysis of emerging experimental designs.

17
GxP Single-cell RNA-seq and Spatial Transcriptomics end-to-end pipeline for clinical research

Zaratiegui, A.; Burfield, T.; Povlsen, H. R.; Sola, M. E. G.; Czaban, A.; Soh, K.; Das, V.

2026-01-26 bioinformatics 10.64898/2026.01.23.701261 medRxiv
Top 0.1%
23.5%
Show abstract

Single-cell/nucleus RNA-sequencing and Spatial Transcriptomics are powerful tools for investigating cellular heterogeneity and tissue architecture that have deepened our disease understanding. Their broader adoption in clinical and regulated settings, however, is hindered by challenges related to data integrity, regulatory compliance, reproducibility, and scalability. To address this gap, we developed NNclinSSOAP (Novo Nordisk Clinical Single-cell Spatial Omics Analytical Pipeline) - a modular, GxP-ready end-to-end computational pipeline, that combines established single-cell workflows with a new Nextflow pipeline for Spatial Transcriptomics. NNclinSSOAP transforms RNA sequencing and Xenium spatial data into integrated, annotated single-cell objects and spatially resolved tissue maps. Designed to support mechanistic studies and clinical endpoint generation, it enables traceable and reproducible processing of large-scale datasets, scalable for both local and HPC environments. Here, we provide a step-by-step guide for using NNclinSSOAP. All code and data are publicly available. Using a standard laptop, the pipeline can be executed within 1.5 hours.

18
PETRA: Prime editing of transcribed regulatory elements to assay expression

Reyes, M. A.; Herger, M.; Cubitt, L.; Findlay, G. M.

2026-01-24 genomics 10.1101/2025.11.18.689114 medRxiv
Top 0.1%
23.4%
Show abstract

Predicting how changes in human DNA sequence impact gene expression remains challenging. Here, we present PETRA (Prime Editing of Transcribed Regulatory elements to Assay expression), a multiplexed genome editing method to quantify the effects of regulatory variants at scale. PETRA leverages the delivery of variants to abundantly transcribed regions of genes such that sequence-specific effects on mRNA expression can be read out by amplicon sequencing. We demonstrate PETRA in Jurkat cells by scoring 13,935 six-nucleotide insertions delivered to the 5 untranslated regions (5 UTRs) of four genes important for T cell responses, namely VAV1, IL2RA, CD28 and OTUD7B. Effects on expression are linked to the creation of new transcription factor binding sites (TFBSs), as well as to alterations in splicing and translation initiation. Combinatorial delivery of TFBSs identified using PETRA generates alleles that increase mRNA expression more than 10-fold. Additionally, we extend PETRA to primary human T cells to compare effects across cell types and use our data to assess the performance of computational models. These results establish PETRA as a flexible means of dissecting and reprogramming the logic of gene regulation across genomic contexts and cell types.

19
Minimum Unique Substrings as a Context-Aware k-mer Alternative for Genomic Sequence Analysis

Adu, A. F.; Menkah, E. S.; Amoako-Yirenkyi, P.; Pandam Salifu, S.

2026-03-03 bioinformatics 10.64898/2026.02.28.708734 medRxiv
Top 0.1%
23.1%
Show abstract

Fixed-length k-mers have long been the standard in sequence analysis. However, they impose a uniform resolution across heterogeneous genomes, often resulting in significant redundancy and a loss of contextual sensitivity. To address these limitations, we introduce Minimum Unique Substrings (MUSs), which are variable-length sequence units that adapt to the local complexity of the genome. MUSs function as context-aware markers that naturally define repeat boundaries by extending only until uniqueness is achieved. We build upon the theoretical relationship between MUSs and maximal repeats, extending this framework to sequencing reads by establishing a read-consistent definition of uniqueness. We present a linear-time ([O] (n)) algorithm based on a generalized suffix tree and introduce the concept of "outposts." These outposts act as anchors for uniqueness, enabling precise localization of MUS boundaries within the sequencing data. Empirical studies of E. coli K-12 and human HiFi reads reveal distinct distributions in MUS lengths that reflect their respective genomic architectures. The compact bacterial genome produces a highly dense set of MUSs with a narrow length distribution (averaging 30.44 bp). In contrast, the repeat-rich human genome requires longer substrings to resolve uniqueness, resulting in an increased mean length (36.08 bp) and a broader distribution that delineates complex repetitive elements. The MUS framework achieves 100% unique coverage with an average length of 36.08 bp, surpassing the 69% coverage of k = 61. By reducing the total number of tokens by over 99%, it provides higher resolution and superior data compression compared to fixed-length k-mer sampling. These results demonstrate that MUSs provide a biologically meaningful, context-sensitive alternative to k-mers, with direct applications in genome assembly, repeat characterization, and comparative genomics.

20
Fixative eXchange (FX)-seq: Scalable Single-nucleus RNA Sequencing Analysis of PFA-fixed or FFPE Tissue

Park, H.-E.; Lee, Y. T.; Lee, J.; Ji, H.; Song, Y.-L.; Lee, J. W.; Kim, S.-Y.; Hur, J. K.; Kim, E.; Lee, C. W.; Han, Y. D.; Kim, H.; Sohn, C. H.

2026-03-07 genomics 10.64898/2026.03.05.709668 medRxiv
Top 0.1%
23.1%
Show abstract

Single-nucleus RNA sequencing (snRNA-seq) of clinical formalin-fixed, paraffin-embedded (FFPE) samples has long been a challenge due to low reverse transcription (RT) yields. Here, we present Fixative-eXchange (FX)-seq, a highly scalable snRNA-seq method for heavily paraformaldehyde (PFA)-fixed and/or FFPE samples. We employ an organocatalyst to facilitate the removal of PFA crosslinks to increase RT yield and additional regiospecific Pt(II)-based crosslinking of RNA molecules to prevent leakage. FX-seq reveals cellular heterogeneity across multiple fixed samples by analyzing 321,710 nuclei, including PFA-fixed tissue, FFPE blocks, thin FFPE and hematoxylin and eosin (H&E)-stained sections from mouse brain and human cancer specimens such as gastrointestinal stromal tumor and colorectal cancer. FX-seq enables integrated analysis with pathologist annotation to label tumor and non-tumor regions of H&E-stained sections. FX-seq can also be applied to PFA-perfusion-based animal studies, large human cohort studies, and personalized drug treatment through precision medicine.